{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "# \ud83d\udcdd Exercise M4.04\n",
    "\n",
    "In the previous Module we tuned the hyperparameter `C` of the logistic\n",
    "regression without mentioning that it controls the regularization strength.\n",
    "Later, on the slides on \ud83c\udfa5 **Intuitions on regularized linear models** we\n",
    "mentioned that a small `C` provides a more regularized model, whereas a\n",
    "non-regularized model is obtained with an infinitely large value of `C`.\n",
    "Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`\n",
    "model.\n",
    "\n",
    "In this exercise, we ask you to train a logistic regression classifier using\n",
    "different values of the parameter `C` to find its effects by yourself.\n",
    "\n",
    "We start by loading the dataset. We only keep the Adelie and Chinstrap classes\n",
    "to keep the discussion simple."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n",
    "penguins = (\n",
    "    penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n",
    ")\n",
    "\n",
    "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n",
    "target_column = \"Species\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "penguins_train, penguins_test = train_test_split(\n",
    "    penguins, random_state=0, test_size=0.4\n",
    ")\n",
    "\n",
    "data_train = penguins_train[culmen_columns]\n",
    "data_test = penguins_test[culmen_columns]\n",
    "\n",
    "target_train = penguins_train[target_column]\n",
    "target_test = penguins_test[target_column]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We define a function to help us fit a given `model` and plot its decision\n",
    "boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging\n",
    "colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped\n",
    "to the white color. Equivalently, the darker the color, the closer the\n",
    "predicted probability is to 0 or 1 and the more confident the classifier is in\n",
    "its predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.inspection import DecisionBoundaryDisplay\n",
    "\n",
    "\n",
    "def plot_decision_boundary(model):\n",
    "    model.fit(data_train, target_train)\n",
    "    accuracy = model.score(data_test, target_test)\n",
    "    C = model.get_params()[\"logisticregression__C\"]\n",
    "\n",
    "    disp = DecisionBoundaryDisplay.from_estimator(\n",
    "        model,\n",
    "        data_train,\n",
    "        response_method=\"predict_proba\",\n",
    "        plot_method=\"pcolormesh\",\n",
    "        cmap=\"RdBu_r\",\n",
    "        alpha=0.8,\n",
    "        vmin=0.0,\n",
    "        vmax=1.0,\n",
    "    )\n",
    "    DecisionBoundaryDisplay.from_estimator(\n",
    "        model,\n",
    "        data_train,\n",
    "        response_method=\"predict_proba\",\n",
    "        plot_method=\"contour\",\n",
    "        linestyles=\"--\",\n",
    "        linewidths=1,\n",
    "        alpha=0.8,\n",
    "        levels=[0.5],\n",
    "        ax=disp.ax_,\n",
    "    )\n",
    "    sns.scatterplot(\n",
    "        data=penguins_train,\n",
    "        x=culmen_columns[0],\n",
    "        y=culmen_columns[1],\n",
    "        hue=target_column,\n",
    "        palette=[\"tab:blue\", \"tab:red\"],\n",
    "        ax=disp.ax_,\n",
    "    )\n",
    "    plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
    "    plt.title(f\"C: {C} \\n Accuracy on the test set: {accuracy:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now create our predictive model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Influence of the parameter `C` on the decision boundary\n",
    "\n",
    "Given the following candidates for the `C` parameter and the\n",
    "`plot_decision_boundary` function, find out the impact of `C` on the\n",
    "classifier's decision boundary.\n",
    "\n",
    "- How does the value of `C` impact the confidence on the predictions?\n",
    "- How does it impact the underfit/overfit trade-off?\n",
    "- How does it impact the position and orientation of the decision boundary?\n",
    "\n",
    "Try to give an interpretation on the reason for such behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]\n",
    "\n",
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Impact of the regularization on the weights\n",
    "\n",
    "Look at the impact of the `C` hyperparameter on the magnitude of the weights.\n",
    "**Hint**: You can [access pipeline\n",
    "steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)\n",
    "by name or position. Then you can query the attributes of that step such as\n",
    "`coef_`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Impact of the regularization on with non-linear feature engineering\n",
    "\n",
    "Use the `plot_decision_boundary` function to repeat the experiment using a\n",
    "non-linear feature engineering pipeline. For such purpose, insert\n",
    "`Nystroem(kernel=\"rbf\", gamma=1, n_components=100)` between the\n",
    "`StandardScaler` and the `LogisticRegression` steps.\n",
    "\n",
    "- Does the value of `C` still impact the position of the decision boundary and\n",
    "  the confidence of the model?\n",
    "- What can you say about the impact of `C` on the underfitting vs overfitting\n",
    "  trade-off?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.kernel_approximation import Nystroem\n",
    "\n",
    "# Write your code here."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}